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Abstract 

In  this  paper  we  consider  mesh  connected  computers  with  multiple  buses, 

providing  broadcast  facilities  along  rows  and  columns.  A  tight  bound  of  G(ni) 

is  established  for  the  number  of  rounds  required  for  semigroup  computations  on 

n  values  distributed  on  a  2-dimensional  rectangular  mesh  of  size  n  with  a  bus  on 

every  row  and  column.  The  upper  bound  is  obtained  for  a  skewed  rectangular 

mesh  of  dimensions  x  This  result  is  to  be  contrasted  with  the  tight 
1  ^ 
bound  of  6(nt)  for  the  same  problem  on  the  sqnart  (n*^  x  mesh  [PR]r 

This  implies  that  in  the  presence  of  multiple  buses,  a  skewed  configuration  may 

perform  better  than  a  square  configuration  for  certain  computational  tasks. 

Our  result  can  be  extended  to  the  d-dimensional  mesb^ giving  a  lower  bound 

^fl(nJi)and  aa  upper  bound  of 
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1  Introduction 


The  mesh  organization  is  considered  an  attractive  and  practical  architecture  for  par¬ 
allel  processing.  The  main  desirable  features  of  this  organization  are  threefold:  it 
has  a  simple,  modular  interconnection  pattern,  which  makes  it  ecisy  to  construct  and 
program;  it  naturally  corresponds  to  the  data  format  of  many  useful  problems  in  ma¬ 
trix  computations  and  image  processing;  and  it  is  amenable  to  VLSI  implementation 
[D,  KLW,  Kr,  Re,  TK,  U].  A  basic  example  of  this  architecture  is  an  arrangement 
of  the  processors  on  integral  points  on  the  plane  in  a  rectangular  form  where  each 
processor  is  connected  by  a  bidirectional  communication  link  to  its  immediate  neigh¬ 
bors  on  the  vertical  and  horizontal  axis.  Information  passes  through  these  links  in 
unit  time.  Typical  tasks  assigned  to  a  computer  based  on  the  mesh  architecture  (a 
mesh-connected  computer)  involve  an  assignment  of  data  items  to  each  of  the  proces¬ 
sors  in  the  mesh  and  a  global  computational  requirement  involving  all  of  the  data 
stored  at  the  processors.  This  computational  requirement  may  entail  the  need  to  sort 
the  elements,  find  certain  order-statistics  on  them  (such  as  their  maximum  etc.)  or 
compute  basic  functions  such  as  partial  sums  and  products.  Typical  applications  are 
presented  in,  e.g.,  (C,  CDL,  G,  Ko] 

The  main  drawback  of  the  mesh  architecture  is  its  large  diameter.  Since  informa¬ 
tion  flow  is  one  of  the  major  factors  affecting  processing  time  on  a  parallel  machine,  a 
large  diauneter  implies  long  delays  even  when  relatively  low  traffic  loads  are  required, 
since  certain  data  items  may  need  to  be  moved  over  long  distances.  For  instance,  in  a 
square  mesh  of  size  n  as  described  above,  a  data  item  may  travel  a  distance  of  0{ y/n) 
in  the  worst  case.  This  implies  long  processing  time  for  various  basic  computational 
tasks. 

A  possible  approach  for  overcoming  the  problem  of  long-distance  data  movements 
is  to  design  a  pau^allel  machine  b<ised  on  the  mesh  configuration  and  extend  it  with 
a  broadcast  mechanism  that  will  enable  fast  data  transfers.  Such  a  mechanism  can 
be  implemented  using  a  6tis,  or  a  collection  of  buses.  This  approach  was  proposed 
in  [B,  G,  JS,  Si],  which  consider  the  addition  of  a  single  global  bus  to  the  mesh. 
It  is  assumed  that  the  mesh  operates  synchronously  using  a  central  clock.  At  the 
beginning  of  each  time  step  a  processor  may  send  a  message  along  any  or  all  of  its 
links,  and  also  send  a  broadcast  message  on  the  global  bus.  Processors  receive  all 
messages  sent  to  them  within  the  same  time  unit,  and  may  perform  some  internal 
computation.  We  assume  that  at  most  one  message  can  be  broadcast  on  the  bus  at 
any  given  time.  While  the  assumption  of  immediate  broadcast  is  unrealistic  since  it 
assumes  that  the  propagation  time  of  messages  on  the  bus  is  independent  of  the  size 


1 


of  the  network,  for  practical  situations  the  difference  may  be  justifiably  ignored. 

While  a  global  bus  enables  us  to  overcome  sporadic  instances  in  which  a  long¬ 
distance  data  movement  is  required,  it  does  not  solve  all  data  flow  problems.  In 
particular,  when  many  data  items  need  to  be  transferred  over  long  distances,  using 
the  single  bus  will  create  a  bottleneck  and  result  in  increasing  the  processing  time. 
In  view  of  this  observation  it  was  proposed  in  [PR,  Ra,  S2j  to  augment  the  mesh 
computer  by  adding  multiple  buses.  In  particular,  it  was  suggested  to  include  a  bus 
for  each  row  and  column  of  the  mesh.  In  a  mesh  with  multiple  buses,  a  processor 
may  locally  communicate  with  its  four  neighbors  or  broadcast  a  message  on  the  bus 
connecting  its  row  or  column.  Again  we  make  the  assumption  that  such  a  broadcast 
takes  unit  time  and  that  at  most  one  message  may  be  broadcast  at  any  given  time. 

We  may  consider  the  addition  of  multiple  buses  to  J-dimensional  meshes  for  any 
d  >  1.  In  such  a  mesh  each  processor  has  2d  links  connecting  it  to  its  2d  immediate 
neighbors.  (A  processor  may  have  fewer  than  2d  neighbors  if  it  is  located  on  the 
“edges”  of  the  mesh.)  In  addition,  each  processor  belongs  to  d  buses,  one  for  each 
dimension. 

Virtually  all  of  the  papers  cited  above  assume  a  square  configuration  for  the  mesh. 
That  is,  a  mesh  of  n  processors  is  assumed  to  have  dimensions  n*/*  x  nV*.  This  as¬ 
sumption,  (or  rather,  “design  decision,”)  is  fully  justified  for  meshes  without  buses. 
This  is  because  for  such  meshes  the  diameter  is  minimized  by  choosing  the  square 
design.  However,  when  multiple  buses  are  added  to  the  mesh,  this  consideration 
becomes  less  important.  At  first  glance,  one  may  argue  that  since  the  architecture 
remains  symmetric  with  respect  to  its  two  dimensions,  a  square  configuration  should 
still  be  preferable  as  far  as  time  complexity  goes.  The  results  described  in  this  pa¬ 
per  indicate  that  this  is  not  the  case.  In  fact,  it  turns  out  that  in  the  presence  of 
multiple  buses,  a  skewed  rectangular  configuration  may  perform  better  than  a  square 
configuration  for  certeiin  computational  tasks. 

We  concentrate  on  the  problem  of  semigroup  computations,  which  is  an  important 
representative  for  the  types  of  problems  suited  for  a  mesh,  and  was  considered  in 
several  of  the  papers  mentioned  above.  Assume  that  each  processor  p  has  a  value 
a(p)  taken  from  an  infinite  domain  A.  An  associative  binary  operation  is  defined 
on  ^  .(for  simplicity  of  terminology  we  refer  to  “-b”  as  addition).  The  task  is  to 
compute  the  sum  A  =  ^a(p),  where  the  summation  is  over  all  the  processors  in  the 
mesh.  Examples  of  such  functions  are  addition,  multiplication  and  maximum. 

Semigroup  computations  were  analyzed  for  meshes  with  a  single  global  bus  and 
multiple  buses.  Bokhari  [B]  gives  an  0(n‘/^  log  n)  time  algorithm  for  computing  max- 


imum  on  a  2-dimensional  mesh  with  a  single  global  bus.  This  result  was  extended 
to  higher  dimensions  and  shown  to  be  optimal  by  Aggarwal  and  Stout  [A,  SI].  They 
established  that  for  the  d-dimensional  mesh  with  a  single  global  bus,  semigroup  com¬ 
putations  require  0(n3^)  time. 

As  for  square  2-dimensional  meshes  with  multiple  buses,  Prasanna  Kumar  and 
Raghavendra  [PR]  give  a  tight  bound  of  0(n«)  for  the  problem. 

Our  main  result  is  that  for  semigroup  computations  the  square  design  is  not 
optimal.  We  give  a  tight  bound  of  0(rii )  on  the  number  of  rounds  needed  to  compute 
an  n-valued  semigroup  function  on  a  2-dimensional  rectangular  mesh  with  row-  and 
column-buses.  The  upper  bound  is  obtained  for  a  skewed  mesh  of  dimensions  x 
n®/®.  We  also  generalize  our  result  to  meshes  of  any  number  of  dimensions  d  >  1.  For 
d-dimensional  meshes  (with  buses  along  each  dimension)  we  present  a  lower  bound 
of  fi(n^)  and  an  upper  bound  of  0(d2‘'+*n^)  on  the  time  complexity  of  semigroup 
computations.  These  bounds  are  tight  for  fixed  d  with  n  tending  to  infinity.  The 
dimensions  n  =  rj  x  . . .  x  rj  for  which  the  upper  bound  is  obtained  are  defined 
as  follows.  Let  r  =  (for  simplicity  assume  that  r  is  an  integer).  For  every  i 
(1  <  *  <  d)  let  Si  =  2'“*d  -f  1  and  define  rj  =  r*'. 

The  results  for  d  >  3  are  merely  of  theoretical  interest,  since  from  a  practical  point 
of  view  only  2  and  3-dimensional  meshes  will  conceivably  become  feasible  in  future 
technologies.  Nonetheless,  we  feel  that  the  observation  conveyed  by  our  bounds  is  of 
general  interest  in  its  own  right. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  presents  some  notation 
and  definitions  needed  for  our  algorithms.  The  algorithm  for  the  2-dimensional  mesh 
and  d-dimensional  mesh  ate  presented  in  Section  3  and  4,  respectively.  In  Section  5 
we  present  the  lower  bound  for  the  d-dimensional  mesh  for  every  d  >  2.  Throughout 
the  rest  of  this  paper  we  refer  to  the  architecture  of  mesh  with  multiple  buses  simply 
as  a  mesh,  and  say  basic  mesh  when  referring  to  a  mesh  without  buses. 

2  Preliminsiries 


The  2-dimensional  mesh  is  a  rectamgular  array  of  processors  of  dimensions  x  x  y, 
where  n  =  xy  is  the  number  of  processors  on  the  mesh.  Denote  the  processors  by  py 
for  all  0  <  »  <  y  —  1  and  0  <  j  <  x  —  1,  and  denote  their  vcilues  by  a,j.  The  rows  and 
the  columns  of  the  mesh  are  denoted  by  . . . ,  Ry^\  and  Co, . . . ,  (/*— i  respectively. 

For  every  i  and  j  where  0  <  i  <  y  -  1  and  0  <  j  <  x  -  1,  the  processor  pij 
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is  connected  by  communication  links  to  its  four  neighbors  and 

p,(j+i).  These  links  enable  direct  message  transmissions  between  neighbors.  Proces¬ 
sors  Poo,  />o(v-i).  P(*-i)Oi  and  p(r-i)(y_i)  have  two  neighbors  and  the  other  processors 
on  the  buses  Co,  C,_i,  Ro  and  have  three  neighbors.  (All  of  our  results  hold 
for  meshes  with  wrap-around,  i.e.,  in  w'hich  the  processors  in  column  Co  and  row 
/?o  are  connected  to  their  corresponding  processors  in  column  C,-t  and  row  /?,_i 
respectively.)  Where  no  con^  ision  arises,  we  use  Ri  and  Cj  to  denote  either  the  set  of 
processors  they  contain  or  the  names  of  the  appropriate  row-buses  and  column-buses 
that  pass  through  them. 

For  the  d-dimensional  mesh  we  need  more  definitions.  Let  n  =  ri  x  rj  x  •  •  •  x  rj 
be  the  size  of  the  d-dimensional  mesh,  where  1  <  ri  <  ra  <  •  •  •  <  rj.  For  simplicity 
we  select  all  the  r,’s  to  be  of  the  form  r*'  for  some  parameter  r,  and  therefore  n  is 
also  a  power  of  r. 

For  every  nonnegative  integer  x  define  =  {0, . . . ,  i  —  1}. 

A  processor  in  the  d-dimensional  mesh  is  represented  by  a  d-vector  (ci,C2, . . 
where  c^  €  Zr,  for  1  <  «  <  d.  Its  input  value  is  denoted  by 
The  basic  mesh  connections  are  as  follows.  For  every  »  (1  <  »  <  d)  if  c,-  <  r,-  —  1 
(respectively,  0  <  c.)  then  processor  (ci, . . . ,Cj, . . . ,Crf)  is  connected  by  a  link  to 
processor  (ci,...,Ci  -1-  l,...,cj)  (respectively,  (ci,...,c,  -  l,...,ca))- 

Given  subsets  Ai  C  for  every  1  <  i  <  d,  denote  by  (Ai, . . . ,  Aj)  the  set  of 
processors  {(xi,...,xd)  |  x,  €  Aj,  1  <  t  <  d}.  When  A<  is  a  singleton  {a}  we 
sometimes  replace  it  by  its  member,  a,  for  clarity. 

A  bus  is  a  l-dimensionaJ  submesh  of  the  mesh.  Every  bus  is  defined  by  a  dimension 
*,  1  ^  ^  d,  and  d  —  1  constants  Cj  €  Zr^  for  1  ^  j  d,  j  i.  Such  a  bus  connects 

the  processors  of  the  set  (ci, . . . ,  c,_i ,  Z,, , c,>i , . .  .j).  The  set  Bi  is  the  set  of  all  buses 
defined  by  the  i’th  dimension. 


3  The  algorithm  for  the  2-dimensional  mesh 

3.1  Outline 

In  this  section  we  present  our  algorithm  for  the  2-dimensional  mesh.  We  set  a  glob2d 
parameter  r  =  n»  (for  simplicity  we  assume  that  r  is  an  integer)  and  select  the 
dimensions  of  the  mesh  to  be  x  =  r*  and  y  =  r^.  During  the  execution  of  the 
algorithm  the  values  get  grouped  and  summed  together  into  some  specially  designated 
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processors,  called  the  active  processors,  and  the  values  they  hold  are  called  active 
values.  The  algorithm  is  defined  in  such  a  way  that  in  any  given  stage,  each  input 
value  “occurs”  in  exactly  one  currently  active  value,  so  the  sum  of  all  the  active  values 
gives  the  correct  result.  At  the  beginning  all  the  processors  are  active  and  at  the  end 
only  processor  poo  is  active. 

The  algorithm  is  composed  of  eight  stages,  some  of  which  are  split  into  two  sub¬ 
stages.  Each  stage  reduces  the  number  of  active  processors  by  a  factor  of  r.  This  is 
done  by  partitioning  the  active  values  into  disjoint  sets  of  cardinality  r,  and  summing 
each  into  one  active  value.  Each  substage  takes  at  most  r  rounds,  and  is  performed  in 
its  entirety  using  either  the  links  or  the  buses,  but  not  both.  In  case  the  summation 
is  done  by  the  links,  the  r  active  values  of  each  set  must  be  at  distance  at  most  r 
from  the  processor  to  which  they  need  to  be  summed.  In  case  the  summation  is  done 
by  the  buses,  the  r  values  of  each  set  must  be  located  on  the  same  bus  and  must  be 
the  only  active  values  on  this  bus.  To  obtain  these  requirements  for  links  or  buses 
the  algorithm  uses  distribution  operations  on  the  active  values,  which  take  at  most  r 
rounds.  Again,  if  the  distribution  is  done  by  links  then  every  active  value  cannot  be 
sent  to  distances  greater  then  r,  while  if  the  distribution  is  done  by  buses  then  each 
bus  used  for  this  operation  contuns  no  more  then  r  active  values. 


3.2  The  basic  procedures 

We  now  describe  four  basic  procedures  on  meshes  with  buses,  performing  the  four 
operations  discussed  above.  All  four  procedures  use  the  global  parameter  r,  which 
equals  ni  in  the  2-dimensional  case. 

Procedure  SUMLINK(fi) 

Input:  The  parameter  is  a  bus  containing  the  processors  ?o, k  =  ir. 
It  is  assumed  that  cill  of  the  processors  are  active,  and  they  hold  the  active  values 
ao, . . . ,  ait-i  respectively. 

We  think  of  the  bus  as  partitioned  into  consecutive  segments  of  length  r,  with  the 
j  th  segment  consisting  of  qjr, . . . ,  For  every  j  6  Zi,  the  procedure  sums  the 

values  of  the  j’th  segment  and  stores  the  result,  ajr+i,  into  qjr-  This  operation 
is  performed  using  the  links  only,  by  sequentially  accumulating  the  values  along  the 
segment,  starting  from  qjr+r-i  and  going  towards  qjr,  and  requires  r  —  1  rounds.  (See 
Figure  la.  Boldface  dots  represent  active  processors.) 
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Output:  There  are  t  active  values  on  /?,  stored  at  the  active  processors  qj,,  j  £  Z/. 
Procedure  SUMBUS(B) 

Input:  The  parameter  P  is  a  bus  containing  the  processors  qo,. . .  ,qk~i,  of  v/hich 
exactly  r  processors  9,0, •  •  •  are  active,  and  hold  the  active  values  a,,, •  • . ,at,_, 
respectively. 

This  procedure  sums  all  r  active  values  in  r  rounds  using  only  the  bus.  In  the 
/th  round,  1  <  j  <  r,  processor  qi^  broadcasts  the  value  on  the  bus  B. 

Output:  Processor  90  is  designated  as  the  only  active  processor  on  B,  setting  its 
cictive  value  to  be  a,y  (See  Figure  lb.)  (Note  that  in  fact,  all  processors  on  B 
know  this  active  value.) 

Procedure  DISTBUS(B) 

Input: 

(1)  The  parameter  B  is  a  bus  containing  the  processors  90f  •  •  t9*-i*  On  B  there  are 
exactly  m  =  tr  active  processors  9io , .  •  • ,  9.„,_,  that  hold  the  active  values 
respectively. 

(2)  U  B  =  Hi  (respectively,  B  =  Cj)  then  define  Bo,...,Br_i  to  be  the  i  buses 

Hi,. . .  (respectively,  The  processors  on  the  bus  Bi  are  de¬ 

noted  by  9j, . . . ,  The  bus  B  is  the  only  one  among  Bo, ... ,  Br_i  that  has  active 
V2jues. 

This  procedure  distributes  the  m  active  values  among  the  buses  Bo, . . . ,  Bt-\  such 
that  each  bus  will  contain  exactly  r  2ictive  values.  In  case  B  =  Ri  (respectively, 
B  =  Cj)  then  the  distribution  is  made  by  the  buses  Cj,,.. (respectively, 

Bjg,  .  .  .  ,  ). 

Output:  The  V2due  is  held  by  processor  9,-;^  which  belongs  to  the  bus  B^iy  (See 
Figure  Ic.) 

Since  this  procedure  is  never  used  concurrently  for  parallel  buses  it  follows  that 
each  bus  distributes  at  most  one  value.  Hence  this  procedure  requires  only  one  round. 

Procedure  DISTLINK(B) 

This  procedure  is  essentially  the  same  as  DISTBUS(B).  The  only  difference 
is  that  the  distribution  is  cauried  out  using  the  links  rather  than  the  buses.  Since 
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it  takes  (  —  1  rounds  for  active  values  to  reach  B(^i,  it  follows  that  this  procedure 
requires  £  —  1  rounds.  (See  Figure  Id.) 


Note  that  one  can  reduce  the  number  of  rounds  required  by  procedures  SUM- 
LINK  and  DISTLINK  by  a  factor  of  roughly  2,  i.e.,  it  is  possible  to  sum  r  values 
(respectively,  distribute  (  values)  in  about  ^  (resp.,  |)  rounds.  However,  for  clearer 
description  of  the  algorithm  we  prefer  the  above  formulation. 


3.3  The  algorithm 

Before  describing  the  algorithm  for  the  2-dimensional  mesh  we  demonstrate  the  usage 
of  these  procedures  for  the  1-dimensional  mesh.  This  mesh  is  equipped  with  a  single 
bus  denoted  B,  and  the  procedures  are  defined  setting  r  =  ni. 

Algorithm  1-DIM 

1.  SUMLINK(B); 

2.  SUMBUS{5); 

It  is  easy  to  verify  that  algorithm  1-DIM  is  correct  and  requires  0(nl)  rounds 
which  is  optimal  by  [Slj. 

We  now  present  the  algorithm  for  the  2-dimensional  mesh.  Recall  that  for  two 
dimensions  we  have  r  =  nl,  x  =  r®  and  y  —  r®.  The  algorithm  is  composed  of  a 
sequence  of  twelve  substages,  each  involving  the  parallel  execution  of  one  of  the  above 
procedures  on  several  buses.  During  the  execution  of  the  algorithm  the  set  ACTIVE 
is  the  set  of  all  active  processors  {i,j)  (recall  that  this  pair  represents  the  processor 
Pij).  In  order  to  clarify  the  flow  of  the  algorithm  we  specify,  for  each  of  the  stages, 
the  set  of  active  processors  after  executing  that  stage  and  its  cardinality  #/i.  In 
particular,  at  the  beginning  of  the  run  ACTIVE  contains  all  the  possible  pairs  and 
#i4  =  n,  and  at  the  end  of  the  tilgorithm  ACTIVE  contains  only  the  pair  (0,0)  and 
#i4  =  1.  Figure  2  depicts  the  flow  of  the  algorithm  for  a  32  x  8  mesh  (r  =  2). 
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0. 

{ZriyZri) 

r» 

1. 

for  i  €  do  SUMLINK(/?,); 

{Z%,Zr^) 

2. 

for;  €  Z;,  do  SUMLINK(Q); 

r“ 

3.1. 

for  j  €  Z;,  do  DISTLINK(C';); 

{(;,  (;■  mod  r)r*  +  ir)  |  i  €  €  Z,.*} 

r« 

3.2. 

for  j  €  do  SUMBUS(Cj); 

1 

r* 

4. 

SUMLINK(/Zo); 

5.1. 

DISTBUS(Ro); 

{(tr*  +;>,»>  1  i  €  Zr^J  €  Z,} 

5.2. 

for  t  €  Zr»  do  SUMBUS(fZ<); 

(0,Z^) 

r3 

6.1. 

DISTBUS(Co); 

+  *)  1  *  €  ZrJ  €  Z,»} 

r3 

6.2. 

for  j  €  Zr7  do  SUMBUS(Q); 

{ZflyO) 

r* 

7.1. 

DISTBUS(/2o); 

1  iyj  €  Zr) 

7.2. 

for  *  €  Zr  do  SUMBUS(/i<); 

{0,Zr) 

r 

8. 

SUMBUS((7o); 

(0,0) 

1 

Observe  that  we  can  omit  stages  6.1  and  7.1,  since  after  summing  on  a  bus  all 
the  processors  on  the  bus  know  the  result,  including,  in  particular,  the  processor 
designated  as  active  after  these  stages.  Straightforward  counting  reveals  that  the 
number  of  rounds  required  by  Algorithm  2-DIM  is  9r  —  1  (or  9r  —  3  if  stages  6.1  and 
7.1  are  omitted),  which  is  (?(n» ).  In  Section  5  we  give  a  matching  lower  bound. 
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It  remains  to  prove  correctness.  Specifically,  we  need  to  show  that  at  the  end  of 
the  run  the  only  active  processor  is  poo  and  its  value,  Coo,  is  indeed  the  desired  value 

j  a,j.  This  requires  us  to  prove  the  following  properties  for  each  of  the  stages: 

1.  The  distribution  of  active  values  on  the  mesh  at  the  beginning  of  ♦he  stage  is 
compatible  with  the  requirements  of  the  procedure  applied  in  this  stage. 

2.  Whenever  a  procedure  is  activated  in  parallel  on  several  buses,  these  activations 
do  not  interfere  with  each  other  (i.e.,  each  processor  participates  in  at  most  one 
activation  of  the  procedure). 

3.  The  set  of  active  processors  in  the  end  of  each  stage  is  as  specified  in  the  above 
table. 

All  of  these  properties  follow  in  a  straightforward  way  from  the  definitions  of  the 
procedures  and  are  left  for  the  reader  to  verify. 


4  The  algorithm  for  the  c?- dimensional  mesh 

4.1  Outline 

In  this  section  we  present  Algorithm  d-DIM  for  the  d-dimensional  mesh  for  arbitrary 
d>2.  This  algorithm  is  a  generalization  of  Algorithm  2-DlM  of  the  previous  section. 

First  let  us  define  the  dimensions  rj,. .  of  the  mesh.  Define  r  =  (again 
for  simplicity  assume  that  r  is  an  integer).  For  every  i  (1  <  t  <  d)  let  Sj  =  2’“'d  -f-  1 
and  define  =  r**.  Note  that  YlUi  so  the  mesh  is  of  size  n. 

As  in  Algorithm  2-DIM  some  of  the  processors  are  active  in  the  sense  that  only 
their  values  need  to  be  summed.  In  each  stage  of  the  algorithm  the  number  of  active 
processors  is  reduced  by  a  factor  of  r*  for  some  integer  s.  Each  such  stage  requires  at 
most  dsr  rounds,  and  makes  use  of  one  of  three  operators  SUMj,  i  =  1,2,3,  defined 
in  the  next  section. 

In  order  to  describe  our  later  constructions  it  is  convenient  to  define  some  special 
submeshes.  For  every  i  (1  <  t  <  d)  and  for  every  j  (I  <  j  S  i)  define  the  following 
sets  of  processors: 


1*  !;,•  (2'r,  > .  .  • ,  »0, .  .  .  ,0,  .  .  . ,  Zr^). 

Thus  Vj,i  is  the  submesh  obtained  by  restricting  the  dimensions  j  through  i 


w  <  *)  to  the  point  0  and  taking  all  points  on  all  other  dimensions.  In  particular, 

2.  =  =  (0,..,,0,Z. . Zr,). 

3.  = 

Thus  Ui  is  a  “sparse”  submesh  of  Wi  containing  every  r’th  point  in  dimensions  a 
through  d.  There  is  an  implicit  correspondence  between  each  point  in  Ui  and  the 
r  X  . . .  X  r  “subcube”  it  belongs  to,  and  we  refer  to  this  point  as  “representing” 
its  subcube. 

Note  that  the  set  W\  is  the  set  of  ail  processors.  Also  observe  that  all  the  buses 
in  Bi  intersect  the  set  VJ,.  in  exactly  one  processor,  and  the  set  5,  is  exactly  the  set 
of  all  buses  that  are  not  conteuned  in  the  set 

4.2  The  SUM,-  operations 

The  algorithm  uses  three  operators  of  the  form  X  =  SUM<(y'),  for  a  =  1,2,3.  The 
sets  X  and  Y  are  the  sets  of  2J1  active  processors  before  and  after  the  operator  is 
applied,  respectively,  and  are,  generally,  submeshes  in  one  of  the  forms  VTj,  t/,  or  Vj,i. 
Let  us  now  describe  how  the  operators  SUM,-  work. 

1)  =  SUMiflV;) 

This  operator  sums  the  values  in  every  r  x  . . .  x  r  subcube  (on  the  dimensions  a 
through  d)  of  the  submesh  Wi  into  the  point  representing  it  in  the  sparse  submesh 
Ui.  More  formally,  the  processor  (0,,..,0,x„...,zj)  where  z,  €  ZJ  for  i  <  j  <  d, 
receives  as  its  new  active  value  the  sum 

^ •  •  • ,  0,  z,'  +  j/if ... . ,  xj  +  ys))' 

The  summation  is  performed  using  only  the  links,  by  d  —  t  +  1  applications  of  the 
procedure  SUMLINK,  starting  with  the  a’’th  dimension  and  ending  with  the  d’th 
dimension.  More  precisely,  the  following  code  is  executed. 

for  j  =  i  to  d  do 

for  every  bus  B  in  Bj  contdning  active  values  do 
SUMLINK(F) 


« 
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Th •'  operator  requires  ^  i  —  i  +  l)(r  —  1)  rounds  and  the  number  of  active  values 
is  reduce<^  ^y  a  factor  _;" 


2)  K.  =  suum) 

The  summation  is  done  in  two  phases.  In  the  first  ph2Lse  the  active  values  are 
distributed  in  a  way  that  on  each  bus  in  the  set  Bi  there  are  exactly  r  active  values. 
Th '  second  phase  involves  applying  procedure  SUMBUS  on  the  buses  of  Bi. 

For  the  distribution  phase  we  need  a  generalized  version  of  the  procedures  DIST- 
BUS  and  DISTLINK.  In  the  2-dimensional  case  all  the  buses  perpendicular  to  the 
given  bus  B  can  distribute  its  active  values.  In  the  d-dimensional  case  the  procedures 
must  get  an  additional  parameter  j  indicating  the  dimension  of  the  distribution.  Thus 
the  distribution  is  done  by  applying  the  generalized  procedure  DISTBUS(B,  j)  in  di¬ 
mensions  j  =  1, . . . ,  i— 1  and  then  applying  the  generalized  procedure  DISTLINK(B,7) 
in  dimensions  j  =  t  -I- 1, . . .  ,d  —  1.  All  the  distributions  are  done  on  the  buses  of  Bd 
that  hn -  e  active  values.  We  omit  the  exact  description  of  the  generalized  procedures, 
which  is  str.iightforward,  but  present  the  description  of  the  operator  SUM2. 

for  j  —  1  to  i  —  1  do 

for  every  bus  B,  B  £  Bd  and  B  has  active  values  do 
DISTBUS(B,;) 
for  i  =  I  +  1  to  d  -  1  do 

for  every  bus  By  B  £  Bd  and  B  has  active  values  do 
DISTLINK(B,i) 
for  every  bus  B,  B  €  Bi  do 
SUMBUS(B) 

The  distribution  on  the  buses  takes  »  —  1  rounds,  the  distribution  on  the  links 
takes  (d  —  i  —  2)(r  —  I)  rounds  and  the  summation  on  the  buses  of  Bi  takes  r  rounds. 
Altogether,  the  operator  requires  (d  -  i  -  l)r  -f  (2i  -  d  -J- 1)  rounds.  The  number  of 
active  values  is  reduced  by  a  factor  of  r. 

3) 

The  operator  consists  of  Sj  pheises,  each  reducing  the  number  of  active  values  by 
a  factor  of  r.  After  an  odd  phase,  t,  the  active  processors  2U'e 

(Zry  ,  . .  .  ,  2r^_i  j  0,  0,,^ . . ,  0,  Zri^,y  . . . ,  Zr^). 

After  an  even  phase,  f,  the  2w:tive  processors  are 

(^r,  I  • . . ,  ^rj_| ,  Z^iy-I,  0, - ,0,  Zr^^y  . . . ,  ZrJ). 
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Each  odd  (respectively,  even)  phase  is  performed  by  first  applying  the  generalized 
procedure  DISTBUS(B,  j)  on  the  buses  of  Bj  in  dimension  j  +  I  (respectively,  on 
the  buses  of  Bj+i  in  dimension  j)  and  then  applying  procedure  SUMBUS  on  the 
buses  of  Bj  (respectively,  Bj+i).  The  exact  description  is  as  follows. 

for  f  =  1  to  Sj  do 
if  (  is  odd  then 

for  every  bus  B  in  Bj  containing  active  values  do 
DISTBUS(B,;  +  1) 

for  every  bus  B  in  Bj  containing  active  values  do 
SUMBUS(B) 
if  i  is  even  then 

for  every  bus  B  in  Bj+i  containing  active  values  do 
DISTBUS(B,i) 

for  every  bus  B  in  Bj+\  containing  active  values  do 
SUMBUS(B) 


As  noted  after  the  description  of  algorithm  2-DlM,  the  distribution  psurt  is  not 
needed.  Therefore  the  operator  SUM3  requires  sjr  rounds.  The  number  of  active 
values  is  reduced  by  a  factor  of  r*^ . 


4.3  The  algorithm 

In  order  to  illustrate  the  usage  of  the  operators  SUM;  let  us  first  provide  a  different, 
equivaJent  formulation  of  Algorithms  1-DIM  and  2- DIM,  which  makes  use  of  these 
operators.  Recall  that  Wi  is  aJways  the  set  of  all  processors,  and  in  the  1-dimensional 
(respectively,  2-dimensionail)  caise  W2  (resp.,  W3)  contaiins  only  the  processor  (0,0). 

Algorithm  1-DIM  Algorithm  2-DIM 

1.  Ui  =  SUMx(W^,);  1.  Ui  =  SUM,(M^,); 

2.  W2  =  V1.1  =  SVMiiUi)]  2.  W2  =  Vt,i  =  SUM,(t;i); 

3.  U2  =  SUMi(W^2); 

4.  V2.2  =  SUM;(C/j); 

5.  VK3  =  SUM3(V'„); 


The  d-DIM  algorithm  is  a  generalization  of  the  above  presentation. 
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Algorithm  d-DIM 


1.  {/,  =  SUMi(Vy,); 

2.  W,  =  K,.!  =  SUMj(t/,); 

3.  for  i  =  2  to  <f  do 

(a)  Ui  =  SUMiiWi); 

(b)  Vi,i  =  SUM2(f/.); 

(c)  for  j  =  i  —  l  down  to  1  do 

K^.i  =  suM3(Vi+,,); 

(d)  =  SUM3(V',,,); 

Let  us  calculate  the  number  of  rounds  required  by  the  algorithm.  Except  for 
Stage  3a,  whenever  the  number  of  active  processors  is  reduced  by  r*  for  some  s,  the 
reduction  takes  5r+0(d)  rounds.  Moreover,  for  every  i(l  <i  <d),  Stage  3a  requires 
{d  —  i  —  2)r  additional  rounds.  Altogether,  algorithm  d-DIM  requires  fewer  than 

d2‘'+*r  =  d2*'+‘ni» 

rounds. 

In  order  to  prove  the  correctness  of  the  algorithm,  one  needs  to  check  that  all  three 
operators  SUM,-  are  correct  according  to  their  specifications;  it  follows  immediately 
that  the  whole  algorithm  works  properly.  Correctness  of  the  SUM;  operators  follows 
from  the  special  way  we  selected  the  sizes  of  the  mesh  in  each  dimension.  Formal 
verification  is  tedious  but  str2ughtforward,  and  is  omitted  from  the  paper. 

5  The  lower  bounds 

The  main  result  of  this  section  is  a  proof  that  every  algorithm  for  semigroup  com¬ 
putation  on  a  rectangular  mesh  with  buses  takes  at  least  T  =  n(nl)  steps,  where  n 
is  the  number  of  processors  in  the  mesh.  The  proof  technique  is  a  generalization  of 
similar  lower  bounds  for  the  1-dimensional  mesh  and  the  square  2-dimensional  mesh 
[SI].  At  the  end  of  this  section  we  extend  this  result  to  d-dimensional  meshes  for 
d>2. 
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The  proof  is  based  on  bounding  from  above  the  maximum  number  of  distinct 
input  values  that  an  active  value  may  “cover”  in  each  step  of  the  algorithm.  Since 
our  semigroup  functions  are  “globally  sensitive,”  in  the  sense  that  any  single  input 
may  be  changed  so  as  to  affect  the  final  result,  we  sometimes  say  that  a  processor  p 
“knows”  some  subset  of  the  inputs,  meaning  that  it  has  their  sum. 

The  basic  idea  is  best  demonstrated  by  reviewing  the  proof  in  (Sl|  for  the  1- 
dimensional  case.  In  this  case  all  n  processors  are  on  the  same  bus.  By  the  end  of 
round  t,  for  0  <  t  <  T,  every  processor  has  received  at  most  1  +  2t  distinct  values 
through  the  links.  Only  one  processor  can  use  the  bus  at  each  round  t,  and  by  doing 
so  it  can  tell  all  other  processors  about  at  most  1  +  2(<  -  1)  =  2<  -  1  new  values 
(unknown  to  them  up  until  now).  Thus  at  time  T  a  processor  may  have  received 
at  most  7T  +  1  values  through  the  links  and  Er=i{2<  -  1)  values  through  the  bus. 
Altogether  it  knows  at  most 

(2r+i)  +  f;(2t-i)  =  (r+i)» 

tsrl 

input  values.  This  number  must  exceed  n,  hence  T  =  ). 

For  the  2-dimensional  case  assume  that  the  mesh  size  is  z  x  y  where  n  =  zy. 
Without  loss  of  generality  let  z  <  y. 

Straightforward  counting  reveals  that  by  the  end  of  round  t,  for  0  <  <  <  T,  a 
processor  has  received  at  most  4(‘+*)  -|- 1  distinct  input  values  through  the  links 
(including  its  own  input  value).  For  the  derivation  of  our  first  inequality  we  make  the 
over-permissive  assumption  that  every  value  sent  on  a  bus  arrives  at  all  n  processors 
(for  “free”).  Therefore,  in  round  t  a  processor  may  receive,  through  the  z  -k  y  buses, 
at  most  (z  -I-  y)  distinct  new  values.  Consequently,  at  the  end  of  round  T 

a  processor  may  know  at  most 

input  values,  where  the  first  term  accounts  for  values  received  through  the  links  and 
the  second  for  those  received  through  the  buses.  This  sum,  which  is  0{T^x  -k  y))  = 
0{T^y)  —  0{T^n/x),  must  exceed  n,  hence 

(1) 

If  the  mesh  is  square,  i.e.,  z  =  y  =  n*/*,  the  last  equation  implies  that  T  =  n(nJ). 
However,  one  can  choose  a  small  value  for  z  and  then  the  bound  on  T  is  not  enough. 
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Therefore  we  need  to  derive  a  second  inequality.  For  that  purpose  we  may  again  make 
a  permissive  assumption,  asserting  th.at  a  value  known  ( :>  a  processor  is  also  known  to 
all  other  processors  on  the  same  row  (for  free).  This  implies  that  we  do  not  need  the 
row-buses.  Moreover,  assume  that  the  goal  function  is  to  sum  only  the  input  values 
of  one  column,  say,  Co,  so  there  are  only  y  input  values.  Similar  arguments  as  for  the 
1-dimensional  case  show  that  after  round  t,  for  0  <  <  <  T,  at  most  2<  —  1  new  values 
can  be  sent  on  each  column-bus.  There  x  such  buses,  so  necessarily 

{l  +  2T)  +  x'£{2t-l)>y, 

t=i 


which  implies  that 

r>  =  o(|). 

Combining  Equations  (1)  and  (2)  we  get 


(y3)2  .  ^  .  y  )  ^ 

X 


(2) 


or 

T  =  n  (ni)  . 


Before  reading  the  derivation  for  the  general  case,  the  reader  may  find  it  instru¬ 
mental  to  consider  the  3-dimensionjd  Ccise.  Assume  that  n  =  xyz  and  that  x  <y  <  z. 
By  arguments  similar  to  the  2  dimensional  case  we  derive  ihrtt  inequalities.  The  first 

(l+6(g))  +  +  +  3^))  >xyz, 

which  implies  that 

T*  =  (3) 


The  second  inequality  is 

(‘ + ‘‘(2)) + (‘  ^(2))  - 

which  implies  that  =  Cl  Multiplying  this  by  equation  (3)  we  get 


(4) 


The  third  inequality  is 
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which  implies  that  =  Cl  Multiplying  this  by  equations  (3)  and  (4)  we  get 

P^  =  Cl{z).  (5) 

Multiplying  equations  (3),  (4)  and  (5)  we  get  that  =  n(n)  and  thus 

r  =  n(n^). 


We  conclude  this  section  by  presenting  the  inequalities  for  any  dimension  d'>2. 
Assuming  that  n  =  rir2  •••rj  and  that  ri  <  r2  <  •  •  •  <  rj  the  following  inequalities 
can  be  derived. 


shH'-Kr.iPs'-- 


(r,  >  rj. 


.  From  these  inequalities  we  get  that  for 
following  holds: 


every  j  in  the  range  1  <  J  <  d,  the 


When  J  =  1  the  denominator  is  1.  By  appropriate  multiplications  of  equations  from 
(6)  we  get  that  for  every  j  in  the  range  1  <j  <  d 


=  n(r,). 

Multiplying  all  the  equations  in  (7)  we  conclude  that  =  ^(n)  and  thus 

-  T  =  Cl(n^). 


(7) 
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a)  SUMLINK(B) 


c)  DISTBUS(B),  {B  =  lU) 


d)  DISTLINK(B),  (B  =  Ri) 


Figure  1:  The  four  procedures,  (r  =  4) 
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